In [1]:
# import libraries

# ARIMA libraries
from math import sqrt
from sklearn.metrics import mean_squared_error
from sklearn import preprocessing
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.graphics.tsaplots import plot_pacf
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.arima_model import ARIMAResults
from statsmodels.tsa.stattools import adfuller
import warnings

# Global map visualization
from bokeh.io import output_notebook, show, output_file
from bokeh.plotting import figure
from bokeh.models import GeoJSONDataSource, LinearColorMapper, ColorBar
from bokeh.palettes import brewer

# Regular libraries
from datetime import datetime, timedelta
import dateutil.parser
import geopandas as gpd
import glob
from IPython.display import Image
import json
import math
import matplotlib.pyplot as plt 
plt.rcParams['figure.figsize'] = (12,8)
import numpy as np
import pandas as pd
import seaborn as sns

Contact Information

  • Lindsay Moir

Business Understanding

Covid-19 is a global pandemic. We have over 150 countries affected and many approaches to the pandemic. It is possible to use data science to evaluate the leadership of countries by using mortality as a litmus test. This notebook, collects the data necessary for that, graphically shows the results for those countries, and then also runs a machine learning model to predict the total mortality by the end of 2020 (global not by country).

Question 1

Who are the best and worst countries in terms of per million population and wealth dealing with Covid-19?

We will use people per million and median income as our metrics for this. An extremely small country would have a relatively small number of deaths and could be totally incompetent at managing covid. Whereas a large country (by population) could do a great job and still have a large number of deaths. Also we have included the Transparency International ranking. Many countries are manipulating their data.

Question 2

Which countries have flattened or are flattening the curve?

This will be defined as the number of New_Cases per day over the last 14 days. We take the last date that data is available and we see if the number of New_Cases is going down. We will do this with a scatter plot and a regression line that is fitted to the data.

Question 3

Can I see a global geographic representation of infections?

We want to see the data on a global map. This allows us to quickly see where the hot spots are. We also have additional features that we will be visualizing on these maps.

Question 4

What is the projected global mortality as of December 31, 2020?

We will use 2 algorithms for this. One is a simply trigonometric model based on a regression line. The other is a more sophisticated ARIMA model which is ideal for predicting time series.

Data Understanding

Gather

John Hopkins Covid-19 Data

The columns change over time. I need to figure out what the super set of these column names are. Use glob and sets and a loop to create this result.

In [2]:
path = r'C:\Users\linds\OneDrive\mystuff\GitHub\COVID-19\csse_covid_19_data\csse_covid_19_daily_reports'
all_files = glob.glob(path + "/*.csv")

li_set = {}

for filename in all_files:
    df = pd.read_csv(filename)
    cols = df.columns
    cols = set(cols)
    li_set = cols.union(li_set)

print(li_set)
{'Latitude', 'Deaths', 'Long_', 'Province_State', 'Lat', 'Longitude', 'Admin2', 'Active', 'Country_Region', 'Last Update', 'FIPS', 'Confirmed', 'Country/Region', 'Last_Update', 'Combined_Key', 'Recovered', 'Province/State'}
In [3]:
# We want the LAST dfs column names
all_df_cols = df.columns
all_df_cols
Out[3]:
Index(['FIPS', 'Admin2', 'Province_State', 'Country_Region', 'Last_Update',
       'Lat', 'Long_', 'Confirmed', 'Deaths', 'Recovered', 'Active',
       'Combined_Key'],
      dtype='object')

li_set is the superset of all column names. So, it is NOT the same as the last csv that I have. I have to do some renaming with every csv that comes in. Now that we know that, we know what column names to change and now we can bring in the entire set of files and concatenate them.

In [4]:
def short_df(df, temp_df):
    
    # Pandas really does not like '/' or ' ' in a column name so ... time to rename
    df.rename(columns={'Province/State': 'Province_State', 
                       'Country/Region': 'Country_Region',
                       'Last Update': 'Last_Update',
                       'Latitude': 'Lat',
                       'Longitude': 'Long_'}, inplace=True)
    
    # Get the current_df columns
    cols = df.columns
    
    # For loop for putting the appropriate columns in temp_df
    for col in cols:
        temp_df[col] = df[col]
        
    return temp_df
In [5]:
all_files = glob.glob(path + "/*.csv")

li = []

for filename in all_files:
    
    # Read in the next csv
    df = pd.read_csv(filename)
    
    # Get the date of the file
    date_string = filename[-14:-4]
    
    # Insert the date_string into the first column of the df
    df.insert(0, "Date_", date_string) 
    
    # If it is a full df then just append it
    if df.shape[1] == 13:
        li.append(df)
    
    else:
        # Need to build a df that is the same number of columns as the largest df (last one)
        # Create a temp_df to hold everything
        temp_df = pd.DataFrame(data=np.nan, columns=all_df_cols.insert(0, 'Date_'), index=df.index)
        
        # Call function to create the short_df
        df = short_df(df, temp_df)

        # append df to li
        li.append(df)

all_df = pd.concat(li)
all_df.shape
Out[5]:
(146031, 13)
In [6]:
# reset the index
all_df.reset_index(inplace=True, drop=True)

# Convert Date_ column to date
# Convert Date_ to datetime format
all_df['Date_'] =  pd.to_datetime(all_df['Date_'], infer_datetime_format=True)
all_df.head()
Out[6]:
Date_ FIPS Admin2 Province_State Country_Region Last_Update Lat Long_ Confirmed Deaths Recovered Active Combined_Key
0 2020-01-22 NaN NaN Anhui Mainland China 1/22/2020 17:00 NaN NaN 1.0 NaN NaN NaN NaN
1 2020-01-22 NaN NaN Beijing Mainland China 1/22/2020 17:00 NaN NaN 14.0 NaN NaN NaN NaN
2 2020-01-22 NaN NaN Chongqing Mainland China 1/22/2020 17:00 NaN NaN 6.0 NaN NaN NaN NaN
3 2020-01-22 NaN NaN Fujian Mainland China 1/22/2020 17:00 NaN NaN 1.0 NaN NaN NaN NaN
4 2020-01-22 NaN NaN Gansu Mainland China 1/22/2020 17:00 NaN NaN NaN NaN NaN NaN NaN

Population and MedianPerCapitaIncome

Now we need population and medium income data. This has been placed on disk in the local directory.

In [7]:
# Read in population and income data
pop_med_income_df = pd.read_csv(r'data/pop_med_income.csv')
pop_med_income_df.rename({"MedianPerCapitaIncome": 'MPC_Inc'}, axis=1, inplace=True)
pop_med_income_df.head()
Out[7]:
Country_Region MPC_Inc Pop2020
0 Luxembourg 18418 625978
1 Norway 19308 5421241
2 Sweden 18632 10099265
3 Australia 15026 25499884
4 Denmark 18262 5792201
In [8]:
pop_med_income_df.dtypes
Out[8]:
Country_Region    object
MPC_Inc            int64
Pop2020            int64
dtype: object

Country Codes

I had an already existing data file that I had originally gotten from World Bank https://wits.worldbank.org/wits/wits/witshelp/content/codes/country_codes.htm. I have added items to it since, the names of the countries keep on changing in the different datasets that I encounter. Eventually this should be bullet proof. Some data munging was required on this. The John Hopkins data does not have an ISO standard Country column.

In [9]:
# Now we want to add a country code
country_codes_df = pd.read_csv(r'data/country_codes_edited.csv')
country_codes_df.head()
Out[9]:
Country_Region ISO3166-1-Alpha-2 ISO3166-1-Alpha-3 ISO3166-1-numeric
0 Aruba AW ABW 533.0
1 Afghanistan AF AFG 4.0
2 Angola AO AGO 24.0
3 Anguilla AI AIA 660.0
4 Aland Islands AX ALA 248.0
In [10]:
# Lets rename the 3 letter code column
country_codes_df.rename({'ISO3166-1-Alpha-3': 'Alpha_3'}, axis=1, inplace =True)
country_codes_df.head(1)
Out[10]:
Country_Region ISO3166-1-Alpha-2 Alpha_3 ISO3166-1-numeric
0 Aruba AW ABW 533.0
In [11]:
# We only want Country_Region and Alpha-3
country_codes_df = country_codes_df[['Country_Region', 'Alpha_3']]
country_codes_df.head(1)
Out[11]:
Country_Region Alpha_3
0 Aruba ABW

Transparency International

One of the huge issues with the covid-19 data is there are lots of politics being played with the numbers by politicians. We will use the Transparency International https://www.transparency.org/cpi2019 data 2019 to give us an inkling of whether or not we can trust the numbers.

In [12]:
trans_int_df = pd.read_csv(r'data/transparency_international.csv')
trans_int_df.head()
Out[12]:
Alpha_3 TI_2019
0 DNK 87
1 NZL 87
2 FIN 86
3 SGP 85
4 SWE 85
In [13]:
# Merge the two dfs (country_codes_df and pop_med_income_df)
pop_inc_cc_codes = country_codes_df.merge(pop_med_income_df)
In [14]:
pop_inc_cc_codes.head()
Out[14]:
Country_Region Alpha_3 MPC_Inc Pop2020
0 Afghanistan AFG 378 38928346
1 Angola AGO 720 32866272
2 Albania ALB 1902 2877797
3 Argentina ARG 4109 45195774
4 Armenia ARM 926 2963243
In [15]:
# Merge the two dfs (country_codes_df and pop_med_income_df)
pop_inc_cc_codes = pop_inc_cc_codes.merge(trans_int_df)
In [16]:
pop_inc_cc_codes.head()
Out[16]:
Country_Region Alpha_3 MPC_Inc Pop2020 TI_2019
0 Afghanistan AFG 378 38928346 16
1 Angola AGO 720 32866272 26
2 Albania ALB 1902 2877797 35
3 Argentina ARG 4109 45195774 45
4 Armenia ARM 926 2963243 42

We will merge this dataframe with the all_df after we have done some aggregations with all_df. However, right now we need to go to the next step which is to describe the data.

Describe, Clean, Explore, and Verify

In [17]:
all_df.describe()
Out[17]:
FIPS Lat Long_ Confirmed Deaths Recovered Active
count 125955.000000 141123.000000 141123.000000 146012.000000 145590.000000 145643.000000 138414.000000
mean 31013.509619 36.413985 -78.290510 626.875373 40.172828 172.595552 383.303849
std 16766.870571 10.720249 45.361649 6677.799220 682.640127 3119.841434 4448.074907
min 66.000000 -51.796300 -170.132000 0.000000 0.000000 0.000000 -160761.000000
25% 18119.000000 33.669269 -96.107511 2.000000 0.000000 0.000000 0.000000
50% 29143.000000 37.851064 -87.419466 11.000000 0.000000 0.000000 2.000000
75% 46027.000000 41.597612 -80.512174 62.000000 2.000000 0.000000 29.000000
max 99999.000000 72.000000 178.065000 236899.000000 29427.000000 189791.000000 165563.000000

The above are all of the numeric columns. For our analysis we can drop FIPS, Recovered, and Active. We will keep the balance of the numeric columns. We have put a record of this in the todos in the Clean section.

In [18]:
all_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146031 entries, 0 to 146030
Data columns (total 13 columns):
 #   Column          Non-Null Count   Dtype         
---  ------          --------------   -----         
 0   Date_           146031 non-null  datetime64[ns]
 1   FIPS            125955 non-null  float64       
 2   Admin2          126461 non-null  object        
 3   Province_State  134751 non-null  object        
 4   Country_Region  146031 non-null  object        
 5   Last_Update     146031 non-null  object        
 6   Lat             141123 non-null  float64       
 7   Long_           141123 non-null  float64       
 8   Confirmed       146012 non-null  float64       
 9   Deaths          145590 non-null  float64       
 10  Recovered       145643 non-null  float64       
 11  Active          138414 non-null  float64       
 12  Combined_Key    138414 non-null  object        
dtypes: datetime64[ns](1), float64(7), object(5)
memory usage: 14.5+ MB

Define

Check for duplicates now.

In [19]:
# Test
all_df.duplicated().sum()
Out[19]:
0

Excellent no duplicates!

Define

We will also drop Admin2, Province_State, Last_Update, and Combined_Key. We are doing everything with Country. We have put a record of this in the todos in the Clean section. However, in order to use our time efficiently as we are describing the data, we will drop these columns now. There is no sense describing columns that you are dropping.

In [20]:
# Code

# Make a copy to avoid the inevitable Pandas warnings.
slim_df = all_df.copy(deep=True)

# Just keep the columns we want.
slim_df = slim_df[['Date_', 'Country_Region', 'Lat', 'Long_', 'Confirmed', 'Deaths']]

# Test
slim_df.head(1)
Out[20]:
Date_ Country_Region Lat Long_ Confirmed Deaths
0 2020-01-22 Mainland China NaN NaN 1.0 NaN
In [21]:
# Null status
slim_df.isnull().mean()
Out[21]:
Date_             0.000000
Country_Region    0.000000
Lat               0.033609
Long_             0.033609
Confirmed         0.000130
Deaths            0.003020
dtype: float64

Define

Add fill backwards to the clean section. There are a few missing values. Need to deal with this now so that we can look at data distribution.

In [22]:
# Code
slim_df = slim_df.bfill()
In [23]:
# Test
# Null status
slim_df.isnull().mean()
Out[23]:
Date_             0.0
Country_Region    0.0
Lat               0.0
Long_             0.0
Confirmed         0.0
Deaths            0.0
dtype: float64

Excellent no nulls!

In [24]:
slim_df.describe()
Out[24]:
Lat Long_ Confirmed Deaths
count 146031.000000 146031.000000 146031.000000 146031.000000
mean 36.255312 -74.527721 626.812259 40.296848
std 10.839424 51.888002 6677.367596 681.629019
min -51.796300 -170.132000 0.000000 0.000000
25% 33.411465 -95.755726 2.000000 0.000000
50% 37.661667 -86.854759 11.000000 0.000000
75% 41.459869 -79.946545 62.000000 2.000000
max 72.000000 178.065000 236899.000000 29427.000000

As you can tell from the .describe() method this is essentially a dataframe of outliers. We are just trying to see the spread, so ... we can be creative.

In [25]:
slim_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 146031 entries, 0 to 146030
Data columns (total 6 columns):
 #   Column          Non-Null Count   Dtype         
---  ------          --------------   -----         
 0   Date_           146031 non-null  datetime64[ns]
 1   Country_Region  146031 non-null  object        
 2   Lat             146031 non-null  float64       
 3   Long_           146031 non-null  float64       
 4   Confirmed       146031 non-null  float64       
 5   Deaths          146031 non-null  float64       
dtypes: datetime64[ns](1), float64(4), object(1)
memory usage: 6.7+ MB
In [26]:
# Get the numeric columns
cols = slim_df.columns
cols = cols[4:]
cols
Out[26]:
Index(['Confirmed', 'Deaths'], dtype='object')
In [27]:
# Get rid of all of the row that are all zeros
no_zeros_df = slim_df[(slim_df[cols].T != 0).all()]
print(no_zeros_df.shape)
no_zeros_df
(54411, 6)
Out[27]:
Date_ Country_Region Lat Long_ Confirmed Deaths
0 2020-01-22 Mainland China 30.975600 112.270700 1.0 17.0
1 2020-01-22 Mainland China 30.975600 112.270700 14.0 17.0
2 2020-01-22 Mainland China 30.975600 112.270700 6.0 17.0
3 2020-01-22 Mainland China 30.975600 112.270700 1.0 17.0
4 2020-01-22 Mainland China 30.975600 112.270700 26.0 17.0
... ... ... ... ... ... ...
146024 2020-05-05 Venezuela 6.423800 -66.589700 361.0 10.0
146026 2020-05-05 West Bank and Gaza 31.952200 35.233200 371.0 2.0
146028 2020-05-05 Yemen 15.552727 48.516388 22.0 4.0
146029 2020-05-05 Zambia -13.133897 27.849332 138.0 3.0
146030 2020-05-05 Zimbabwe -19.015438 29.154857 34.0 4.0

54411 rows × 6 columns

Now we can run some boxplots (below) since we will not have divide by 0 errors.

In [28]:
no_zeros_df.describe()
Out[28]:
Lat Long_ Confirmed Deaths
count 54411.000000 54411.000000 54411.000000 54411.000000
mean 34.947233 -62.868397 1656.274981 107.907574
std 12.227660 61.941420 10861.153858 1113.357032
min -51.796300 -157.971218 1.000000 1.000000
25% 32.097133 -91.183266 28.000000 1.000000
50% 36.757339 -84.193818 97.000000 3.000000
75% 41.150279 -75.139955 376.000000 11.000000
max 71.706900 174.886000 236899.000000 29427.000000
In [29]:
no_zeros_df.hist();
In [30]:
# Lets create a log10 of Confirmed and Deaths
log10_df = no_zeros_df[cols]
log10_df = np.log10(log10_df)
In [31]:
# Now plot Confirmed
plt.figure(figsize=(6, 4))
sns.boxplot(x=log10_df['Confirmed'])
Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x1853c78f198>
In [32]:
# Now plot Deaths
plt.figure(figsize=(6, 4))
sns.boxplot(x=log10_df['Deaths'])
Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x1853ceec5f8>

As expected, no real insightful information to be gained from this.

Summary of Cleaning Issues

all_df

  • We will need several versions of dataframes. Each question requires slightly different data to drive the result.
  • There are nulls in all_df that we have filled backwards. The reason that we fill backwards is that we think that the reporting is slow. This is a time series. Filling backwards is close to the best value that we could get if not the best.

country_codes_df

  • This has been data wrangled to successfully merge with all_df on the Country_Region column.

pop_med_income_df

  • This was data wrangled and should yield no issues while merging with all_df on the Country_Region column. Thereafter we will use the Alpha_3 column.

trans_int_df

  • This dataset is clean and does not require any wrangling.

Data Preparation

As we go thru this notebook we will be transforming the data so that we can answer the questions posed. To start lets look at some global visualizations.

In [33]:
def plot_confirmed_death(df, place):
    """Plots daily running total of confirmed cases and deaths"""

    start = df.index[0].strftime('%Y-%m-%d')
    end = df.index[-1].strftime('%Y-%m-%d')
    title = ('From {} to {} Daily Running Total Statistics For Covid-19 ({})').format(start, end, place)
    plt.title(title)
    plt.xlabel('Date')
    plt.ylabel('# Of Cases and Deaths Per Day')
    plt.plot('Confirmed', data=df, color='orange', linewidth=2, label='Confirmed Cases')
    plt.plot('Deaths', data=df, linewidth=2, label='Deaths')
    plt.legend()
    plt.savefig(r'pics_final/running_totals.png');
In [34]:
def plot_mortality_rate(df, place):
    """Mortality rate plotted"""

    start = df.index[0].strftime('%Y-%m-%d')
    end = df.index[-1].strftime('%Y-%m-%d')
    title = ('From {} to {} Change in Mortality For Covid-19 ({})').format(start, end, place)
    plt.title(title)
    plt.xlabel('Date')
    plt.ylabel('Mortality Rate')
    plt.plot('Mortality_Rate', data=df, linewidth=2, label='Deaths')
    plt.legend()
    plt.savefig(r'pics_final/mortality_rate.png');
In [35]:
def mortality_population(df, sign):
    """Plot the Deaths_e6. This is adjusted for population. It shows what countries are doing well or poorly
    considering their Median Income and Transparency International score"""
    
    sns.set_color_codes("pastel")
    sns.barplot(x="Deaths_e6", y="Country_Region", data=df,
                label="Deaths Per Million Population", palette="Blues_d")

    # Title
    title = ('Mortality Rate Per Million Population As of {}').format(end)
    plt.xlabel('Deaths Per Million')
    plt.title(title)
    
    # Construct legend
    legend_ = f'Includes Only Countries With {sign} 50% Median Per Capita Income and\
    >= 60 on Transparency International Score'
    plt.legend([legend_])
    
    # Final housekeeping
    sns.despine(left=True, bottom=True)
    plt.savefig(r'pics_final/honest_but_no_money.png');
In [36]:
def regression_new_cases(df, place, days):
    """Number of New_Cases with Regression Line"""
    
    # Create the title
    end = df['Date_'].iloc[-1].strftime('%Y-%m-%d')
    title = ('New Cases For Last {} Days Prior To And Including {} For Covid-19 With Regression Line ({})').format(
                            days, end, place)
    plt.title(title)
    
    # We want the indexes in a list so we can place them on the x axis
    new_labels = df['Date_'].dt.strftime('%m-%d').tolist()
    plt.xticks(np.arange(len(new_labels)), new_labels)
    
    # Label the axis
    plt.xlabel('Consecutive Days')
    plt.ylabel('New_Cases Daily')
    
    # Create the plot
    sns.regplot(x=range(len(df)), y='New_Cases', data=df)
    plt.savefig(r'pics_final/regresion_mortality_rate.jpg');
In [37]:
def regression_deaths(df, place):
    """Number of Deaths with Regression Line"""
    
    # Get start and end times of df
    start = df.index[0].strftime('%Y-%m-%d')
    end = df.index[-1].strftime('%Y-%m-%d')
    
    # Format the title and labels
    title = ('From {} to {} Daily Mortality For Covid-19 ({})').format(start, end, place)
    plt.title(title)
    plt.xlabel('Days')
    plt.ylabel('# Of Deaths Per Day')
    
    # Plot regression line
    sns.regplot(x=range(len(df)), y='Deaths', data=df, label='Deaths')
    plt.savefig(r'pics_final/running_totals.png');
In [38]:
def plot_world_map(merged, values):
    """Maps features onto a global map for visualization purposes"""
    
    # unpack values
    feature, legend_string, title_string = values
    
    # Read data to json.
    merged_json = json.loads(merged.to_json())

    # Convert to String like object.
    json_data = json.dumps(merged_json)

    # Input GeoJSON source that contains features for plotting.
    geosource = GeoJSONDataSource(geojson = json_data)

    # Define a sequential multi-hue color palette.
    palette = brewer['YlGnBu'][8]

    # Reverse color order so that dark blue is highest obesity.
    palette = palette[::-1]

    # Instantiate LinearColorMapper that linearly maps numbers in a range, into a sequence of colors.
    color_mapper = LinearColorMapper(palette = palette, 
                                     low = 0, 
                                     high = 1, 
                                     nan_color = '#d9d9d9')

    # Define custom tick labels for color bar.
    tick_labels = {'0': legend_string}

    # Create color bar. 
    color_bar = ColorBar(color_mapper=color_mapper, 
                         label_standoff = 8, 
                         width = 500, height = 20,
                         border_line_color = None, 
                         location = (0,0), 
                         orientation = 'horizontal', 
                         major_label_overrides = tick_labels)

    # Create figure object.
    p = figure(title = title_string,
               plot_height = 600 , 
               plot_width = 950, 
               toolbar_location = None)

    p.xgrid.grid_line_color = None
    p.ygrid.grid_line_color = None

    # Add patch renderer to figure. 
    p.patches('xs','ys', source = geosource, fill_color = {'field' : feature, 'transform' : color_mapper},
              line_color = 'black', line_width = 0.25, fill_alpha = 1)

    # Specify figure layout.
    p.add_layout(color_bar, 'below')

    # Display figure inline in Jupyter Notebook.
    output_notebook()

    # Display figure.
    show(p)
In [39]:
def calc_angle(df, a):
    """Calculates the angle of the regression line"""
    
    # Getting the length of all 3 sides
    # a is visualized based on a regression line of April 30 2020. This may change (obviously)
    b = (df.index[-1] - df.index[0]).days # Length of the dataframe (x axis).
    c = math.sqrt(a**2 + b**2)  # calculate the hypotenuse
    
    # Calculate the tangent
    tan = math.atan(a/b)
    angle = math.degrees(tan)
    
    return angle
In [40]:
def calc_deaths(angle, start, end):
    """Calculates the number of deaths based on the angle generated from calc_angle and days calculated from dates"""
    
    # Calculate the number of days for b (x axis / adjacent)
    start = dateutil.parser.parse(start)
    end = dateutil.parser.parse(end)
    days_ = end - start
    days_ = days_.days
    
    # Calculate the number of deaths (height / opposite)
    deaths = days_ * (math.tan(math.radians(angle)))
    
    return int(deaths), days_
In [41]:
def place_value(number): 
    """Changes number to a readable number with ','s for 000s'"""
    
    return ("{:,}".format(number))
In [42]:
# Plot global mortality rate
global_df = slim_df.groupby('Date_').sum()
global_df = global_df[['Confirmed', 'Deaths']]
global_df['Mortality_Rate'] = global_df['Deaths'] / global_df['Confirmed']

# Clip outliers for mortality
global_df = global_df[global_df['Mortality_Rate'] <= .1]
plot_mortality_rate(global_df, 'Global')

It is extremely likely that the number of cases is at least 10 times higher than is being Confirmed. So, this mortality rate is likely 1/10 of the number shown here.

In [43]:
global_df.describe()
Out[43]:
Confirmed Deaths Mortality_Rate
count 9.500000e+01 95.000000 95.000000
mean 9.630820e+05 61556.557895 0.044889
std 1.158559e+06 82109.801407 0.017126
min 1.203800e+04 259.000000 0.020408
25% 7.999250e+04 2668.500000 0.033329
50% 2.427130e+05 9867.000000 0.040653
75% 1.809097e+06 111295.500000 0.061514
max 3.662691e+06 257239.000000 0.071661
In [44]:
# Plot Confirmed and Deaths in one plot
plot_confirmed_death(global_df, 'Global')
In [45]:
# Run plot for New York
new_york = all_df[all_df['Admin2'] == 'New York City'].copy(deep=True)
new_york.set_index('Date_', drop=True, inplace=True)
plot_confirmed_death(new_york, 'New york')

These are both VERY scary plots. They do not show any abatement or signs of flattening. If anything they have gone exponentially higher.

Question 1

Who are the best and worst countries dealing with Covid-19?

We need a dataframe that has these columns but we ONLY want the last date in the Covid-19 data. Then a groupby of that dataframe. Then to this dataframe we append Deaths_e6, Confirmed_e6, Median_Income, and the Transparency International score.

In [46]:
# Now merge all_df with country_codes_df.
# First of all rename Country in country_codes_df to Country_Region to make the inner join simple
covid_pop_inc_df = all_df.merge(country_codes_df)
covid_pop_inc_df.head()
Out[46]:
Date_ FIPS Admin2 Province_State Country_Region Last_Update Lat Long_ Confirmed Deaths Recovered Active Combined_Key Alpha_3
0 2020-01-22 NaN NaN Anhui Mainland China 1/22/2020 17:00 NaN NaN 1.0 NaN NaN NaN NaN CHN
1 2020-01-22 NaN NaN Beijing Mainland China 1/22/2020 17:00 NaN NaN 14.0 NaN NaN NaN NaN CHN
2 2020-01-22 NaN NaN Chongqing Mainland China 1/22/2020 17:00 NaN NaN 6.0 NaN NaN NaN NaN CHN
3 2020-01-22 NaN NaN Fujian Mainland China 1/22/2020 17:00 NaN NaN 1.0 NaN NaN NaN NaN CHN
4 2020-01-22 NaN NaN Gansu Mainland China 1/22/2020 17:00 NaN NaN NaN NaN NaN NaN NaN CHN
In [47]:
len(covid_pop_inc_df)
Out[47]:
146079
In [48]:
# Lets put covid_pop_inc_df on a diet
covid_pop_inc_df = covid_pop_inc_df[['Date_', 'Alpha_3', 'Country_Region', 'Lat', 'Long_', 'Confirmed', 'Deaths']]
In [49]:
covid_pop_inc_df.head()
Out[49]:
Date_ Alpha_3 Country_Region Lat Long_ Confirmed Deaths
0 2020-01-22 CHN Mainland China NaN NaN 1.0 NaN
1 2020-01-22 CHN Mainland China NaN NaN 14.0 NaN
2 2020-01-22 CHN Mainland China NaN NaN 6.0 NaN
3 2020-01-22 CHN Mainland China NaN NaN 1.0 NaN
4 2020-01-22 CHN Mainland China NaN NaN NaN NaN
In [50]:
pop_inc_cc_codes.head(1)
Out[50]:
Country_Region Alpha_3 MPC_Inc Pop2020 TI_2019
0 Afghanistan AFG 378 38928346 16
In [51]:
pop_inc_cc_codes.dtypes
Out[51]:
Country_Region    object
Alpha_3           object
MPC_Inc            int64
Pop2020            int64
TI_2019            int64
dtype: object
In [52]:
covid_pop_inc_df.dtypes
Out[52]:
Date_             datetime64[ns]
Alpha_3                   object
Country_Region            object
Lat                      float64
Long_                    float64
Confirmed                float64
Deaths                   float64
dtype: object
In [53]:
covid_pop_inc_df.head(1)
Out[53]:
Date_ Alpha_3 Country_Region Lat Long_ Confirmed Deaths
0 2020-01-22 CHN Mainland China NaN NaN 1.0 NaN
In [54]:
q1_df = covid_pop_inc_df.groupby(['Date_','Alpha_3'], as_index=False)['Confirmed', 'Deaths'].sum()
end = q1_df.iloc[-1,0].strftime('%Y-%m-%d')
end
C:\Users\linds\Anaconda3\envs\sklearn\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
  """Entry point for launching an IPython kernel.
Out[54]:
'2020-05-05'
In [55]:
q1_df = q1_df[q1_df['Date_'] == end]
print(q1_df.shape)
q1_df.head(1)
(183, 4)
Out[55]:
Date_ Alpha_3 Confirmed Deaths
11533 2020-05-05 AFG 3224.0 95.0
In [56]:
# Now we want to merge the population and median income data with this.
q1_df = q1_df.merge(pop_inc_cc_codes)
q1_df.head(1)
Out[56]:
Date_ Alpha_3 Confirmed Deaths Country_Region MPC_Inc Pop2020 TI_2019
0 2020-05-05 AFG 3224.0 95.0 Afghanistan 378 38928346 16
In [57]:
# Now convert Confirmed and Deaths to a rate per million population
q1_df['Confirmed_e6'] = q1_df['Confirmed'] / (q1_df['Pop2020'] / 1000000)
q1_df['Deaths_e6'] = q1_df['Deaths'] / (q1_df['Pop2020'] / 1000000)
q1_df.head(1)
Out[57]:
Date_ Alpha_3 Confirmed Deaths Country_Region MPC_Inc Pop2020 TI_2019 Confirmed_e6 Deaths_e6
0 2020-05-05 AFG 3224.0 95.0 Afghanistan 378 38928346 16 82.818828 2.440381
In [58]:
# Make a copy of this df
q1_df_c = q1_df.copy(deep=True)

We do not think the reporting is correct for countries that are very poor. We are restricting our analysis to just the top 50% of all countries based on MedianPerCapitaIncome AND the country must have a Transparency International rating of 60 or over. The USA has a rating of 69.

In [59]:
q1_df = q1_df[q1_df['TI_2019'] >= 60]
q1_median_income = int(q1_df['MPC_Inc'].median())
q1_df = q1_df[q1_df['MPC_Inc'] >= q1_median_income]
q1_df.sort_values('Deaths_e6', inplace=True)
print(q1_df.shape)
q1_df.head()
(14, 10)
Out[59]:
Date_ Alpha_3 Confirmed Deaths Country_Region MPC_Inc Pop2020 TI_2019 Confirmed_e6 Deaths_e6
5 2020-05-05 AUS 6875.0 97.0 Australia 15026 25499884 77 269.609070 3.803939
83 2020-05-05 NZL 1488.0 21.0 New Zealand 12147 4822233 87 308.570739 4.354829
81 2020-05-05 NOR 7955.0 215.0 Norway 19308 5421241 84 1467.376197 39.658816
37 2020-05-05 FIN 5412.0 246.0 Finland 15725 5540720 86 976.768362 44.398562
6 2020-05-05 AUT 15650.0 606.0 Austria 12284 9006398 77 1737.653610 67.285501
In [60]:
sign = '>='
mortality_population(q1_df, sign)

How many unique Alpha_3 codes (Country Codes) do we have in this dataset?

In [61]:
q1_df['Alpha_3'].nunique()
Out[61]:
14

Answer to Question 1

The above graph shows who is performing the best and the worst on the covid-19 pandemic. The only countries included in this bar graph have a Transparency International score of >= 60 and a MedianPerCapita Income >= 50% of the worlds countries. Countries with the shortest bars are doing better. Remember this IS adjusted for population. So, the fact that New Zealand is right at the top is remarkable. We should be looking at Australia and New Zealand to understand how they achieved this.

The USA is in the bottom third and the UK is at the very bottom. Perhaps a wake up call for their populations? The universe of countries that I chose for this was severely restricted by having the above cutoffs. As a result I feel this is a honest evaluation of the facts.

One sobering thought here is that there are 203 unique Alpha_3 codes in this dataset. That corresponds to 203 countries. We are only looking at 28 of them. This is because of the concern over the honesty of the data. In otherwords the challenge that we are looking at here is only the tip of the iceberg. NB There are 195 countries in the world https://www.thoughtco.com/number-of-countries-in-the-world-1433445. The John Hopkins data has odd Country_Regions such as Cruise Ships.

I also looked at only countries that are "poor" but "honest" below.

In [62]:
# Lets se how the other half lives:)
no_money_but_honest = q1_df_c[q1_df_c['TI_2019'] >= 60]
no_money_but_honest = no_money_but_honest[no_money_but_honest['MPC_Inc'] < q1_median_income]
no_money_but_honest.sort_values('Deaths_e6', inplace=True)

# Call the plotting function
sign = '<'
mortality_population(no_money_but_honest, sign)
In [63]:
print(no_money_but_honest['Country_Region'].tolist())
['Taiwan', 'Botswana', 'Singapore', 'Qatar', 'Japan', 'Uruguay', 'Chile', 'Lithuania', 'Israel', 'Estonia', 'Slovenia', 'Portugal', 'Spain', 'Belgium']

There are no obvious take aways from this list. Taiwan is an island and was very aware of the dangers due to its intimate relationship with China. Botswana likely does not have the ability to count deaths. That sounds harsh but ... no national health system. People are probably just dying at home with no testing. Japan is highly sophisticated but ... it is surprising they are on this list as having a lower Median Per Capita Income. They just missed the cutoff. Anyways, no obvious take aways from this one.

Question 2

Which countries have flattened or are flattening the curve?

We will take our universe of countries from a set that is twice as large as Q1 above and see if they are experiencing a flattening of the curve. The metric will be the last 14 days with a regression line. If the regression line is flat or going down, they are flattening the curve.

In [64]:
# groupby that creates a sum for each date for each country (Alpha_3).
q2_df = covid_pop_inc_df.groupby(['Date_','Alpha_3'], as_index=False)['Confirmed', 'Deaths'].sum()
q2_df.head()
C:\Users\linds\Anaconda3\envs\sklearn\lib\site-packages\ipykernel_launcher.py:2: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
  
Out[64]:
Date_ Alpha_3 Confirmed Deaths
0 2020-01-22 CHN 547.0 17.0
1 2020-01-22 HKG 0.0 0.0
2 2020-01-22 JPN 2.0 0.0
3 2020-01-22 KOR 1.0 0.0
4 2020-01-22 MAC 1.0 0.0
In [65]:
# Now we want to merge the population and median income data with this.
q2_df = q2_df.merge(pop_inc_cc_codes)
print(q2_df.shape)
q2_df.head(1)
(8370, 8)
Out[65]:
Date_ Alpha_3 Confirmed Deaths Country_Region MPC_Inc Pop2020 TI_2019
0 2020-01-22 CHN 547.0 17.0 China 1786 1439323775 41
In [66]:
# Need to add the column New_Cases
# Calculate # of New_Cases/day
q2_df['New_Cases'] = q2_df['Confirmed'] - q2_df['Confirmed'].shift(1) 
print(q2_df.shape)
q2_df.tail(5)
(8370, 9)
Out[66]:
Date_ Alpha_3 Confirmed Deaths Country_Region MPC_Inc Pop2020 TI_2019 New_Cases
8365 2020-05-01 TJK 15.0 0.0 Tajikistan 713 9537645 25 0.0
8366 2020-05-02 TJK 76.0 2.0 Tajikistan 713 9537645 25 61.0
8367 2020-05-03 TJK 128.0 2.0 Tajikistan 713 9537645 25 52.0
8368 2020-05-04 TJK 230.0 3.0 Tajikistan 713 9537645 25 102.0
8369 2020-05-05 TJK 293.0 5.0 Tajikistan 713 9537645 25 63.0
In [67]:
# Now convert New_Casess to a rate per million population
q2_df['New_Cases_e6'] = q2_df['Confirmed'] / (q2_df['Pop2020'] / 1000000)
q2_df.head(1)
Out[67]:
Date_ Alpha_3 Confirmed Deaths Country_Region MPC_Inc Pop2020 TI_2019 New_Cases New_Cases_e6
0 2020-01-22 CHN 547.0 17.0 China 1786 1439323775 41 NaN 0.38004
In [68]:
q2_df.sort_values(['Date_', 'Alpha_3'], ignore_index=True, inplace=True)
print(q2_df.shape)
q2_df.tail()
(8370, 10)
Out[68]:
Date_ Alpha_3 Confirmed Deaths Country_Region MPC_Inc Pop2020 TI_2019 New_Cases New_Cases_e6
8365 2020-05-05 USA 1204351.0 71064.0 United States 15480 331002651 69 23976.0 3638.493518
8366 2020-05-05 VNM 271.0 0.0 Vietnam 1124 97338579 37 0.0 2.784097
8367 2020-05-05 YEM 22.0 4.0 Yemen 400 29825964 15 10.0 0.737612
8368 2020-05-05 ZAF 7572.0 148.0 South Africa 1217 59308690 44 352.0 127.671004
8369 2020-05-05 ZMB 138.0 3.0 Zambia 287 18383955 34 1.0 7.506546
In [69]:
# This is the df that we need for q4. Lets make a copy of it.
q4_df = q2_df.copy(deep=True)
In [70]:
# Need the date of the last row in the df
end = q2_df.iloc[-1,0]
end
Out[70]:
Timestamp('2020-05-05 00:00:00')
In [71]:
# We want the date x days ago
days = 14
start = end - timedelta(days=days)
start
Out[71]:
Timestamp('2020-04-21 00:00:00')
In [72]:
q2_df = q2_df[q2_df['Date_'] >= start]
q2_df.shape
Out[72]:
(1812, 10)

We do not think the reporting is correct. We are restricting our analysis to countries that have a Transparency International rating >= 60. The USA has a rating of 69.

In [73]:
q2_df = q2_df[q2_df['TI_2019'] >= 60]
print(q2_df.shape)
q2_df.head()
(420, 10)
Out[73]:
Date_ Alpha_3 Confirmed Deaths Country_Region MPC_Inc Pop2020 TI_2019 New_Cases New_Cases_e6
6563 2020-04-21 AUS 6547.0 67.0 Australia 15026 25499884 77 0.0 256.746266
6564 2020-04-21 AUT 14873.0 491.0 Austria 12284 9006398 77 78.0 1651.381607
6567 2020-04-21 BEL 40956.0 5998.0 Belgium 10189 11589623 75 973.0 3533.850929
6576 2020-04-21 BWA 20.0 1.0 Botswana 740 2351627 61 0.0 8.504750
6577 2020-04-21 CAN 39402.0 1909.0 Canada 15181 37742154 77 1744.0 1043.978571
In [74]:
# Get a list with no duplicates from Country_Region
countries = q2_df['Country_Region'].tolist()
countries = (set(countries))
countries = list(countries)
countries.sort()
print(len(countries))
print(countries)
28
['Australia', 'Austria', 'Belgium', 'Botswana', 'Canada', 'Chile', 'Denmark', 'Estonia', 'Finland', 'France', 'Germany', 'Israel', 'Japan', 'Lithuania', 'Luxembourg', 'Netherlands', 'New Zealand', 'Norway', 'Portugal', 'Qatar', 'Singapore', 'Slovenia', 'Spain', 'Sweden', 'Taiwan', 'United Kingdom', 'United States', 'Uruguay']
In [75]:
# Did not run this one because the output was too small to be of use. 
# I include it here because if I was reviewing this I would ask why I did not use FacetGrid.

# g = sns.FacetGrid(q2_df, col="Country_Region", col_wrap=4, height=2)
# g.map(sns.pointplot, 'Date_String', 'New_Cases', order=date_string, color=".3", ci=None);

We run the below regression plot in full size. In small size (above), you lose the details of whether or not the # of New_Cases is trending up or down.

In [76]:
# Run regression line plot of ONLY countries that ARE NOT flattening the curve
flat = []
still_rising = []

for country in countries:

    # Just the country of interest
    recent = q2_df[q2_df['Country_Region'] == country]
    
    # Check and see if there is any data in this df
    if len(recent) > 0:
        # Check and see if it is rising or falling

        # Find half way
        first_half = float(len(recent))
        first_half = int(first_half / 2)

        # Do the math           
        if recent['New_Cases'].iloc[:first_half].mean() > recent['New_Cases'].iloc[first_half:].mean():
            flat.append(country)
        else:
            still_rising.append(country)

    else:
        print('This country is in the countries list but ... has no data associated with it.', country)
        
    # Plot
    regression_new_cases(recent, country, days)
    
    # Create legend
    plt.legend([f'{country} Has a Transparency International Score of {recent["TI_2019"].iloc[0]}'], loc='upper center')
    plt.savefig(r'pics_final/rising_falling.jpg')
    plt.show();
In [77]:
print('flat \n', sorted(flat))
print('\nstill_rising \n', sorted(still_rising))
flat 
 ['Australia', 'Austria', 'Belgium', 'Botswana', 'Canada', 'Denmark', 'Estonia', 'Finland', 'France', 'Germany', 'Israel', 'Japan', 'Lithuania', 'Luxembourg', 'Netherlands', 'New Zealand', 'Norway', 'Portugal', 'Qatar', 'Singapore', 'Slovenia', 'Spain', 'Sweden', 'United States', 'Uruguay']

still_rising 
 ['Chile', 'Taiwan', 'United Kingdom']

Answer to Question 2

The above graphs show you the rising and falling regression lines for each country. NB some times the crude calculation that I make in the code is not as capable at recognizing the trend as the human eye. The flat list contains the countries that have daily flat to declining number of New_Cases of covid-19. Much better than I thought!

We do not cutoff based on Median Per Capita Income on this list. The only cutoff is the Transparency International score. That is why we have ~ twice as many countries being considered.

Question 3

Can I see a global geographic representation of infections?

Yes, however, we are only going to do this for the last date that we have data on. The John Hopkins dashboard at https://coronavirus.jhu.edu/map.html is an impressive interactive facility that has the ability visualize data geographically at a much greater level of sophistication than what I will be doing here. The purpose here was to append demographic information (population and Median Per Capita Income) and a "honesty score" via the Transparency International score so that we cam gain some confidence about the usefulness of this data.

In [78]:
# Identify shape file
shapefile = r'data/ne_110m_admin_0_countries/ne_110m_admin_0_countries.shp'

#Read shapefile using Geopandas
gdf = gpd.read_file(shapefile)[['ADMIN', 'ADM0_A3', 'geometry']]

#Rename columns.
gdf.columns = ['country', 'country_code', 'geometry']
gdf.head()
Out[78]:
country country_code geometry
0 Fiji FJI MULTIPOLYGON (((180.00000 -16.06713, 180.00000...
1 United Republic of Tanzania TZA POLYGON ((33.90371 -0.95000, 34.07262 -1.05982...
2 Western Sahara SAH POLYGON ((-8.66559 27.65643, -8.66512 27.58948...
3 Canada CAN MULTIPOLYGON (((-122.84000 49.00000, -122.9742...
4 United States of America USA MULTIPOLYGON (((-122.84000 49.00000, -120.0000...

We can drop the row for ‘Antarctica’ as it unnecessarily occupies a large space in our map and is not required in our current analysis.

In [79]:
gdf[gdf['country'] == 'Antarctica']
Out[79]:
country country_code geometry
159 Antarctica ATA MULTIPOLYGON (((-48.66062 -78.04702, -48.15140...
In [80]:
#Drop row corresponding to 'Antarctica'
gdf = gdf.drop(gdf.index[159])
In [81]:
# groupby that creates a sum for each date for each country (Alpha_3).
q3_df = covid_pop_inc_df.groupby(['Date_','Alpha_3'], as_index=False)['Confirmed', 'Deaths'].sum()
q3_df.head()
C:\Users\linds\Anaconda3\envs\sklearn\lib\site-packages\ipykernel_launcher.py:2: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
  
Out[81]:
Date_ Alpha_3 Confirmed Deaths
0 2020-01-22 CHN 547.0 17.0
1 2020-01-22 HKG 0.0 0.0
2 2020-01-22 JPN 2.0 0.0
3 2020-01-22 KOR 1.0 0.0
4 2020-01-22 MAC 1.0 0.0
In [82]:
# Now we want to merge the population and median income data with this.
q3_df = q3_df.merge(pop_inc_cc_codes)
print(q3_df.shape)
q3_df.head(1)
(8370, 8)
Out[82]:
Date_ Alpha_3 Confirmed Deaths Country_Region MPC_Inc Pop2020 TI_2019
0 2020-01-22 CHN 547.0 17.0 China 1786 1439323775 41
In [83]:
# Restrict the analysis to just the most recent date for data
q3_df = q3_df[q3_df['Date_'] == end]
print(q3_df.shape)
q3_df.tail(1)
(122, 8)
Out[83]:
Date_ Alpha_3 Confirmed Deaths Country_Region MPC_Inc Pop2020 TI_2019
8369 2020-05-05 TJK 293.0 5.0 Tajikistan 713 9537645 25
In [84]:
q3_df.describe()
Out[84]:
Confirmed Deaths MPC_Inc Pop2020 TI_2019
count 1.220000e+02 122.000000 122.000000 1.220000e+02 122.000000
mean 2.925217e+04 2076.459016 3981.016393 5.843516e+07 44.409836
std 1.160038e+05 8073.562506 4848.304543 1.836940e+08 19.193988
min 3.000000e+00 0.000000 47.000000 4.415430e+05 13.000000
25% 5.480000e+02 11.000000 617.750000 6.009308e+06 29.250000
50% 2.834500e+03 59.500000 1790.500000 1.468904e+07 40.000000
75% 1.355200e+04 374.250000 5162.000000 4.382039e+07 57.500000
max 1.204351e+06 71064.000000 19308.000000 1.439324e+09 87.000000
In [85]:
q3_df.isnull().mean()
Out[85]:
Date_             0.0
Alpha_3           0.0
Confirmed         0.0
Deaths            0.0
Country_Region    0.0
MPC_Inc           0.0
Pop2020           0.0
TI_2019           0.0
dtype: float64

Above looks fine.

In [86]:
# Change Date to string date. Bokey does not like real dates (sigh)
q3_df['Date_'] = end.strftime('%Y-%m-%d')
q3_df.head(1)
Out[86]:
Date_ Alpha_3 Confirmed Deaths Country_Region MPC_Inc Pop2020 TI_2019
104 2020-05-05 CHN 83968.0 4637.0 China 1786 1439323775 41
In [87]:
# Add column for Deaths Per Million
q3_df['Deaths_e6'] = q3_df['Deaths'] / (q3_df['Pop2020'] / 1000000)

Need to normalize the data.

In [88]:
# Get object column names
cat_cols = q3_df.select_dtypes('object').columns
cat_cols
Out[88]:
Index(['Date_', 'Alpha_3', 'Country_Region'], dtype='object')
In [89]:
# Keep only the categorical columns
cat_cols_df = q3_df[cat_cols]
cat_cols_df.reset_index(drop=True, inplace=True)
print(cat_cols_df.shape)
cat_cols_df.head(1)
(122, 3)
Out[89]:
Date_ Alpha_3 Country_Region
0 2020-05-05 CHN China
In [90]:
# Get numeric column names
num_cols = q3_df.select_dtypes([np.number]).columns
num_cols
Out[90]:
Index(['Confirmed', 'Deaths', 'MPC_Inc', 'Pop2020', 'TI_2019', 'Deaths_e6'], dtype='object')
In [91]:
# numpy array of just the numeric values
x = q3_df[num_cols].values
x.shape
Out[91]:
(122, 6)
In [92]:
# normalize the data
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
In [93]:
# Now a numeric df
num_cols_df = pd.DataFrame(x_scaled, columns=num_cols)
print(num_cols_df.shape)
num_cols_df.head(1)
(122, 6)
Out[93]:
Confirmed Deaths MPC_Inc Pop2020 TI_2019 Deaths_e6
0 0.069718 0.065251 0.090286 1.0 0.378378 0.004658
In [94]:
# Concatenate the two dataframes.
new_df = pd.concat([cat_cols_df, num_cols_df], axis=1)
print(new_df.shape)
new_df.head(1)
(122, 9)
Out[94]:
Date_ Alpha_3 Country_Region Confirmed Deaths MPC_Inc Pop2020 TI_2019 Deaths_e6
0 2020-05-05 CHN China 0.069718 0.065251 0.090286 1.0 0.378378 0.004658
In [95]:
new_df.isnull().mean()
Out[95]:
Date_             0.0
Alpha_3           0.0
Country_Region    0.0
Confirmed         0.0
Deaths            0.0
MPC_Inc           0.0
Pop2020           0.0
TI_2019           0.0
Deaths_e6         0.0
dtype: float64
In [96]:
new_df.describe()
Out[96]:
Confirmed Deaths MPC_Inc Pop2020 TI_2019 Deaths_e6
count 122.000000 122.000000 122.000000 122.000000 122.000000 122.000000
mean 0.024286 0.029220 0.204248 0.040305 0.424457 0.060335
std 0.096321 0.113610 0.251716 0.127664 0.259378 0.159180
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000453 0.000155 0.029632 0.003870 0.219595 0.001525
50% 0.002351 0.000837 0.090520 0.009902 0.364865 0.006503
75% 0.011250 0.005266 0.265563 0.030148 0.601351 0.032867
max 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

Above all looks good!

In [97]:
# Merge dataframes with left join. Avoids missing countries.
merged = gdf.merge(new_df, left_on = 'country_code', right_on = 'Alpha_3', how='left')

# Replace NaN values to string 'No data'. This is caused by the left join (obviously)
merged.fillna('No data', inplace = True)

print(merged.shape)
merged.head(5)
(177, 12)
Out[97]:
country country_code geometry Date_ Alpha_3 Country_Region Confirmed Deaths MPC_Inc Pop2020 TI_2019 Deaths_e6
0 Fiji FJI MULTIPOLYGON (((180.00000 -16.06713, 180.00000... No data No data No data No data No data No data No data No data No data
1 United Republic of Tanzania TZA POLYGON ((33.90371 -0.95000, 34.07262 -1.05982... 2020-05-05 TZA Tanzania 0.000396065 0.000225149 0.0174965 0.0412075 0.324324 0.000387265
2 Western Sahara SAH POLYGON ((-8.66559 27.65643, -8.66512 27.58948... No data No data No data No data No data No data No data No data No data
3 Canada CAN MULTIPOLYGON (((-122.84000 49.00000, -122.9742... 2020-05-05 CAN Canada 0.0524865 0.0589609 0.785733 0.0259233 0.864865 0.160509
4 United States of America USA MULTIPOLYGON (((-122.84000 49.00000, -120.0000... 2020-05-05 USA United States 1 1 0.801256 0.229735 0.756757 0.310406
In [98]:
q3_df.head(1)
Out[98]:
Date_ Alpha_3 Confirmed Deaths Country_Region MPC_Inc Pop2020 TI_2019 Deaths_e6
104 2020-05-05 CHN 83968.0 4637.0 China 1786 1439323775 41 3.221652
In [99]:
# Instantiate a dictionary with the numeric feature columns
feature_dict = dict(zip(num_cols, range(len(num_cols))))
feature_dict
Out[99]:
{'Confirmed': 0,
 'Deaths': 1,
 'MPC_Inc': 2,
 'Pop2020': 3,
 'TI_2019': 4,
 'Deaths_e6': 5}
In [100]:
# Now that we have this in a function, lets run all of the numeric columns thru this.
# Create a dictionary with the appropriate strings
feature_dict.update({num_cols[0]: ['Confirmed', 'Fewest Cases', 'Covid-19 Confirmed Cases By Country'],
                    num_cols[1]: ['Deaths', 'Fewest Deaths', 'Covid-19 Deaths By Country'],
                    num_cols[2]: ['MPC_Inc', 'Lowest', 'Median Per Capita Income By Country'],
                    num_cols[3]: ['Pop2020', 'Lowest', 'Population By Country'],
                    num_cols[4]: ['TI_2019', 'Least Honest', 'Transparency International Score By Country'],
                    num_cols[5]: ['Deaths_e6', 'Fewest Deaths', 'Covid-19 Deaths By Country Per Million Population']
                   })
In [101]:
# Prepare map for all features
for feature in num_cols:
    plot_world_map(merged, feature_dict[feature])
Loading BokehJS ...
Loading BokehJS ...
Loading BokehJS ...
Loading BokehJS ...
Loading BokehJS ...
Loading BokehJS ...

Answer to Question 3

The above images quickly tell you where the challenges are. Any country that shows color on the 'Covid-19 Deaths By Country Adjusted For Population' map is in trouble. The grey countries are not reporting data. If you look at the 'Transparency International Score By Country' (right above that one), you can immediately see the issue about reporting. Countries that are honest are reporting and as a result report higher deaths. The vast majority of countries are either not reporting or are "not honest".

Model Data

Question 4

What is the projected global mortality by December 31, 2020?

The way we are going to answer this is to look at a correlation heat map. That will instruct as to what variable would be helpful in terms of predicting deaths. Then we are simply going to do a simple regression line and extend it using trigonometry. Finally I will run an ARMIM model that predicts every day going forward to the end of 2020.

In [102]:
# Look at df just to remind me of what I have.
q4_df.head(1)
Out[102]:
Date_ Alpha_3 Confirmed Deaths Country_Region MPC_Inc Pop2020 TI_2019 New_Cases New_Cases_e6
0 2020-01-22 CHN 547.0 17.0 China 1786 1439323775 41 NaN 0.38004
In [103]:
# Look at the correlations. Should be pretty good.
sns.heatmap(df.corr(), annot=True, fmt=".2f");

The correlation is very high between Confirmed and Deaths. You can see New_Cases_e6 is also correlated (.54) with deaths. There are obvious cases of multicollinearity in the features that are in this dataset (Confirmed, New_Cases, New_Cases_e6). There is considerable uncertainty about the quality of the data that is being reported. We want to predict to the end of 2020 what will be the total number of global deaths. If we have a regression line we can do that quite simply using trigonometry. Lets do that below.

We simply do not need all of the complexity of the q4_df dataset to answer this question. The globaldf that we created earlier in the notebook works quite well for this. It is created via a groupby on Date with a sum. This yields the following very simple dataset after dropping a few columns. Remember though, for Confirmed and for Deaths the numbers are cumulative. Mortality_Rate is a simple division based on the previous 2 columns.

In [104]:
print(global_df.shape)
global_df.tail(1)
(95, 3)
Out[104]:
Confirmed Deaths Mortality_Rate
Date_
2020-05-05 3662691.0 257239.0 0.070232

Trigonometry

We are going to find the angle (bottom left hand side) then calculate the Opposite. The Adjacent is the difference in days between Feb 1 2020 and Dec 31 2020. In order to get the initial angle we are using the above regression line and creating a right triangle to work from.

In [105]:
regression_deaths(global_df, 'Global')
In [106]:
# Estimate the total number of deaths by the end of the year

# Find the angle based on the above regression line.
angle = calc_angle(global_df, 230000)

# Time period
start = '2020-02-01'
end = '2020-12-31'

# Estimate year end deaths
deaths, days = calc_deaths(angle, start, end)
print(f'The number of expected deaths forecast to occcur from Covid-19 by {end} is: {place_value(deaths)}.')
print(f'This is over the next {days} days.')
The number of expected deaths forecast to occcur from Covid-19 by 2020-12-31 is: 817,234.
This is over the next 334 days.

ARIMA Model

I have a number of features that we could use to predict mortality. However, the obvious one is to simply use Death. This machine learning algorithm is well suited to univariate time series.

Create and summarize a stationary version of the time series.

In [107]:
# create a differenced series
def difference(dataset):
    """Create a difference between first value and the second value in the input"""
    
    diff = []
    for i in range(1, len(dataset)):
        value = dataset[i] - dataset[i - 1]
        diff.append(value)

    return pd.Series(diff)
In [108]:
# Load the data 
series = global_df['Deaths']
series.head()
Out[108]:
Date_
2020-02-01    259.0
2020-02-02    362.0
2020-02-03    426.0
2020-02-04    492.0
2020-02-05    564.0
Name: Deaths, dtype: float64
In [109]:
# Convert it to an array
X = series.values
X = X.astype('float32') 
In [110]:
# difference data
stationary = difference(X)
stationary.index = series.index[1:]
In [111]:
# check if stationary
result = adfuller(stationary)
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
    print('\t%s: %.3f' % (key, value))
ADF Statistic: -1.215171
p-value: 0.667036
Critical Values:
	1%: -3.514
	5%: -2.898
	10%: -2.586
In [112]:
# plot differenced data
stationary.plot()
plt.show();

That is not stationary!. That is not good. We can get the ARIMA model to grid search a solution to this.

In [113]:
# save
stationary.to_csv(r'data/stationary.csv')
In [114]:
# ACF and PACF plots of the time series
series = global_df['Deaths']
plt.figure(figsize=(16,6))
plt.subplot(211)
plot_acf(series, ax=plt.gca())
plt.subplot(212)
plot_pacf(series, ax=plt.gca())
plt.show()
  • The ACF shows significant lags to 7 time steps.
  • The PACF shows significant lags to 1 time step.

A good starting point for the p is 7 and q as 1. This quick analysis suggests an ARIMA(7,1,1) on the raw data may be a good starting point. I ran it. It did not converge. So, lets evaluate a series of ARIMA models with a try and except block to avoid "LinAlgError: SVD did not converge"

In [115]:
# grid search ARIMA parameters for a time series

def evaluate_arima_model(X, arima_order):
    """evaluate an ARIMA model for a given order (p,d,q) and return RMSE"""    
    
    # prepare training dataset
    X = X.astype('float32')
    train_size = int(len(X) * 0.50)
    train, test = X[0:train_size], X[train_size:]
    history = [x for x in train]
    
    # make predictions
    predictions = []
    for t in range(len(test)):
        model = ARIMA(history, order=arima_order)
        
        # fit model
        try:
            model_fit = model.fit(trend='nc', disp=0)
            yhat = model_fit.forecast()[0]
            predictions.append(yhat)
            history.append(test[t])
        except:
            continue
        
    # calculate out of sample error
    rmse = sqrt(mean_squared_error(test, predictions))
    return rmse
In [116]:
def evaluate_models(dataset, p_values, d_values, q_values):
    """evaluate combinations of p, d and q values for an ARIMA model"""    
    
    dataset = dataset.astype('float32')
    best_score, best_cfg = float("inf"), None
    for p in p_values:
        for d in d_values:
            for q in q_values:
                order = (p,d,q)
                try:
                    rmse = evaluate_arima_model(dataset, order)
                    if rmse < best_score:
                        best_score, best_cfg = rmse, order
                    print('ARIMA%s RMSE=%.3f' % (order,rmse))
                except:
                    continue
                   
    print('Best ARIMA%s RMSE=%.3f' % (best_cfg, best_score))
    
    return best_cfg
In [117]:
# load dataset
series = global_df['Deaths']

# evaluate parameters
p_values = range(0, 8)
d_values = range(0, 3)
q_values = range(0, 3)
warnings.filterwarnings("ignore")
best_cfg = evaluate_models(series.values, p_values, d_values, q_values)
ARIMA(0, 0, 1) RMSE=72821.941
ARIMA(0, 0, 2) RMSE=37909.589
ARIMA(0, 1, 1) RMSE=3524.670
ARIMA(0, 2, 1) RMSE=1496.945
ARIMA(0, 2, 2) RMSE=1659.622
ARIMA(1, 0, 0) RMSE=5620.410
ARIMA(1, 1, 0) RMSE=1439.814
ARIMA(1, 2, 0) RMSE=1510.875
ARIMA(2, 1, 0) RMSE=1507.964
ARIMA(2, 2, 0) RMSE=1557.698
ARIMA(3, 1, 0) RMSE=1552.893
ARIMA(3, 2, 0) RMSE=1539.242
ARIMA(5, 2, 0) RMSE=1507.360
ARIMA(6, 2, 0) RMSE=1496.843
Best ARIMA(1, 1, 0) RMSE=1439.814

As we can see from the above, a LOT of these parameters resulted in the model not being able to fit it.

In [118]:
# summarize residual errors for an ARIMA model

# load data
series = global_df['Deaths']

# prepare data
X = series.values
X = X.astype('float32')
train_size = int(len(X) * 0.50)
train, test = X[0:train_size], X[train_size:]

# walk-forward validation
history = [x for x in train]
predictions = []
for i in range(len(test)):
    
    # predict
    model = ARIMA(history, order=best_cfg)
    model_fit = model.fit(trend='nc', disp=0)
    yhat = model_fit.forecast()[0]
    predictions.append(yhat)
    
    # observation
    obs = test[i]
    history.append(obs)

# errors
residuals = [test[i]-predictions[i] for i in range(len(test))]
residuals = pd.DataFrame(residuals)
print(residuals.describe())

# plot
plt.figure()
plt.subplot(211)
residuals.hist(ax=plt.gca())
plt.subplot(212)
residuals.plot(kind='kde', ax=plt.gca())
plt.show();

# Save mean for bias adjustment below
bias = residuals.describe()
bias = bias.iloc[1][0]
print('\nbias saved in bias for subsequent run is:', bias, '\n')
                 0
count    48.000000
mean    158.490642
std    1446.208193
min   -4607.224606
25%    -307.116714
50%     239.033007
75%     724.419244
max    4615.354485
bias saved in bias for subsequent run is: 158.4906419577184 

The histogram shows that the distribution has a right skew and that the mean is non-zero. This is perhaps a sign that the predictions are biased. The distribution of residual errors is also plotted. The graphs suggest a Gaussian-like distribution with a longer left tail. Lets see if the bias made any real difference.

In [119]:
# summarize residual errors from bias corrected forecasts

# load data
series = global_df['Deaths']

# prepare data
X = series.values.astype('float32')
train_size = int(len(X) * 0.50)
train, test = X[0:train_size], X[train_size:]

# walk-forward validation
history = [x for x in train]
predictions = list()
bias = bias # from above run
for i in range(len(test)):
    
    # predict
    model = ARIMA(history, order=best_cfg)
    model_fit = model.fit(trend='nc', disp=0)
    yhat = bias + float(model_fit.forecast()[0])
    predictions.append(yhat)
    
    # observation
    obs = test[i]
    history.append(obs)

# report performance
rmse = sqrt(mean_squared_error(test, predictions))
print('RMSE: %.3f' % rmse)

# summarize residual errors
residuals = [test[i]-predictions[i] for i in range(len(test))]
residuals = pd.DataFrame(residuals)
print(residuals.describe())

# plot
# histogram
plt.figure()
plt.subplot(211)
residuals.hist(ax=plt.gca())

# density
plt.subplot(212)
residuals.plot(kind='kde', ax=plt.gca())
plt.show()
RMSE: 1431.064
                  0
count  4.800000e+01
mean  -4.964325e-12
std    1.446208e+03
min   -4.765715e+03
25%   -4.656074e+02
50%    8.054236e+01
75%    5.659286e+02
max    4.456864e+03

The bias had almost no effect on the RMSE. Considering that we will be looking at deaths in the range of 6 to 7 figures, not surprising that this made so little difference.

The summary of the forecast residual errors shows that the mean was indeed moved to a value very close to zero. Finally, density plots of the residual error do show a small shift towards zero. Lets take a look at test vs predictions.

In [120]:
# Assemble title
title = ('Model Prediction of Test vs Predictions For Covid-19 ({}) For {} Days').format('Global', len(test))
plt.title(title)

# Create x and y axis labels
plt.xlabel('Days')
plt.ylabel('Cumulative Deaths By Day')

# Create plot
plt.plot(test, linewidth=2, label='Test')
plt.plot(predictions, linewidth=2, label='Predictions')
plt.legend()
plt.savefig(r'pics_final/test_vs_prediction.png');

Wow hard to believe that this is that accurate!

In [123]:
# save finalized model to file

def __getnewargs__(self):
    """# monkey patch around bug in ARIMA class"""
    return ((self.endog),(self.k_lags, self.k_diff, self.k_ma))

ARIMA.__getnewargs__ = __getnewargs__

# load data
series = global_df['Deaths']

# prepare data
X = series.values.astype('float32')

# fit model
model = ARIMA(X, order=best_cfg)
model_fit = model.fit(trend='nc', disp=0)

# bias constant, could be calculated from in-sample mean residual
bias = bias

# save model
model_fit.save(r'data/model.pkl')
np.save(r'data/model_bias.npy', [bias])
In [124]:
# load finalized model and make a prediction

# Load model
model_fit = ARIMAResults.load(r'data/model.pkl')
bias = np.load(r'data/model_bias.npy')

# Pick a future to predict. 0 means literally tomorrow.
yhat = bias + float(model_fit.forecast()[0])
print('Predicted: %.0f' % yhat)
Predicted: 262909

This proves the model was saved, loaded successfully, and performed a prediction. Lets forecast every day between now and December 31, 2020. We are also going to use ALL of the data since we have now put the model into "production".

In [125]:
global_df.shape
Out[125]:
(95, 3)
In [126]:
def difference(dataset, interval=1):
    """create a differenced series"""
    
    diff = []
    for i in range(interval, len(dataset)):
        value = dataset[i] - dataset[i - interval]
        diff.append(value)
        
    return np.array(diff)


def inverse_difference(history, yhat, interval=1):
    """invert differenced value"""
    return yhat + history[-interval]

# load dataset
series = global_df['Deaths']

# This algorithm accommodates seasonal variation which may be helpful if there is a season to covid.
# Flu season runs from November thru March with a peak in December and February each year.
# This is for the northern hemisphere. Obviously the opposite for the southern hemisphere.
# To get the maximum information into the model currently use length /2.
X = series.values
period = int(X.shape[0]/2)
differenced = difference(X, period)

# fit model
model = ARIMA(differenced, order=(best_cfg))
model_fit = model.fit(disp=0)

# forecast Dec 31 2020
start = global_df.index[-1]
end = pd.to_datetime('2021-01-01', infer_datetime_format=True)
steps = end - start
steps = steps.days

# Fit the out of sample
forecast = model_fit.forecast(steps=steps)[0]

# invert the differenced forecast to something usable
history = [x for x in X]
inverted_ = []

for day, yhat in enumerate(forecast):
    
    inverted = inverse_difference(history, yhat, period)
    history.append(inverted)
    inverted_.append(int(inverted))
    
# Create a df for reporting
# Create a date_range index
start = global_df.index[-1]
end = start + timedelta(days=steps-1)

# Create the df
forecast_df = pd.DataFrame({'Deaths': inverted_}, index=pd.date_range(start=start, end=end))

# Shift Deaths by one day to make it lineup correctly with the date. 
forecast_df['Deaths'] = forecast_df['Deaths'].shift(1)
forecast_df = forecast_df[1:]
forecast_df.tail(1)
Out[126]:
Deaths
2020-12-31 5020676.0

That is a pretty big number. Lets check and see if it is reasonable. We will look at global_df and calculate its daily percent increase. Then we will look at forecast_df and see if it is comparable.

In [127]:
# Take a copy of global_df
global_df_c = global_df.copy(deep=True)

# Drop the extra columns in global_df_c compared to report_df
global_df_c = global_df_c[['Deaths']]

# Create the extra columns
global_df_c['Increase'] = global_df_c['Deaths'] - global_df_c['Deaths'].shift(1)
global_df_c['Percent_Increase'] = (global_df_c['Increase'] / global_df_c['Deaths']) * 100
global_df_c['Percent_Increase'] = round(global_df_c['Percent_Increase'], 2)
global_df_c['Deaths'] = global_df_c['Deaths'].astype(np.int)

# Look at it
global_df_c.tail()
Out[127]:
Deaths Increase Percent_Increase
Date_
2020-05-01 238650 5262.0 2.20
2020-05-02 243808 5158.0 2.12
2020-05-03 247470 3662.0 1.48
2020-05-04 251537 4067.0 1.62
2020-05-05 257239 5702.0 2.22
In [128]:
# Look at daily percent increase for last 5 days
forecast_df['Increase'] = forecast_df['Deaths'] - forecast_df['Deaths'].shift(1)
forecast_df['Percent_Increase'] = (forecast_df['Increase'] / forecast_df['Deaths']) * 100
forecast_df['Percent_Increase'] = round(forecast_df['Percent_Increase'] ,2)
forecast_df['Deaths'] = forecast_df['Deaths'].astype(np.int)
forecast_df.head()
Out[128]:
Deaths Increase Percent_Increase
2020-05-06 263289 NaN NaN
2020-05-07 269621 6332.0 2.35
2020-05-08 275959 6338.0 2.30
2020-05-09 282545 6586.0 2.33
2020-05-10 289393 6848.0 2.37
In [129]:
print('These means all have to do with the day to day percentage increase.\n')
print(f'Last 14 days actual data mean is: {round(global_df_c.Percent_Increase.iloc[-14:].mean(), 2)}')
print(f'First 14 days forecast mean is: {round(forecast_df.Percent_Increase.iloc[:14].mean(), 2)}')
print(f'Actual data mean is: {round(global_df_c.Percent_Increase.mean(), 2)}')
print(f'Forecast mean is: {round(forecast_df.Percent_Increase.mean(), 2)}')
These means all have to do with the day to day percentage increase.

Last 14 days actual data mean is: 2.65
First 14 days forecast mean is: 2.52
Actual data mean is: 6.96
Forecast mean is: 1.22

If anything the ARIMA model is being MORE conservative than the data would suggest.

In [130]:
predicted = place_value(int(forecast_df["Deaths"].iloc[-1]))
forecast_date = forecast_df.index[-1].strftime('%Y-%m-%d')
print(f'The prediction is for {predicted} cumulative deaths to occur by {forecast_date}')
The prediction is for 5,020,676 cumulative deaths to occur by 2020-12-31
In [131]:
# Lets visualize that.

# Add a new column that is Deaths_e6 (deaths per million)
forecast_df['Deaths_e6'] = forecast_df['Deaths'] / 1000000

# Assemble title
start = forecast_df.index[0].strftime('%Y-%m-%d')
title = ('{} Forecast Cumulative Deaths In Millions For Covid-19 ({} to {})').format('Global', start, forecast_date)
plt.title(title)

# Create x and y axis labels
plt.xlabel('Date')
plt.ylabel('Cumulative Deaths In Millions')

# Create plot
plt.plot('Deaths_e6', data=forecast_df, linewidth=4, label='Deaths')
plt.legend()
plt.savefig(r'pics_final/arima_prediction.png');

ARIMA thinks that this is going to keep on going. I can certainly see why. There is nothing in the dataset that says it will do anything other than what it is doing here, which is heading up and to the right ver rapidly.

Answer to Question 4

We developed 2 models. The first was based on trigonometry. This yielded an answer of approximately 817,000 people will die by the end of 2020. We also created a more sophisticated ARIMA model. This predicts several million people will not make it to 2021. The model was incredibly accurate in its test vs prediction scores. I think this is because the math of a pandemic is so simple. It is easy for the model to learn it. Lets hope that neither model is correct. However, I have a very bad feeling.

Results and Conclusions

We answered 4 questions in this notebook. Please click on these links to look at those answers.

  • Answer To Q1
  • Answer To Q2
  • Answer To Q3
  • Answer To Q4
  • 24 of 28 countries (that met the cutoffs) have a flattening of the curve presently. That is VERY encouraging.

    It is surprising to see rich countries struggle with this virus. There is a saying that democracies get the type of leadership that they deserve. Hopefully this is a wake up call that electing incompetent politicians with massive character flaws to the highest office in the land can and does cause death. This time those deaths are being counted. They are not hidden in Syria, Afghanistan, etc. They are right in the home countries that have weak leadership.

    Including the Median Per Capita Income and the Transparency International score was helpful in terms of creating cutoffs that made sense.

    One sobering thought here is that there are 203 unique Alpha_3 codes in this dataset. That corresponds to 203 Cpountry_Regions. We are only looking at 28 of them. This is because of the concern over the honesty of the data. In otherwords the challenge that we are looking at here is only the tip of the iceberg.

    Limitations

    Generally speaking I believe the reporting on the John Hopkins data is extremely suspect. This is due to political and on the ground issues. Politicians do not want to be accused of doing less than a great job, especially when it is obvious they are doing anything but a great job. Many countries simply do not have the infrastructure, systems, etc. to care for the sick and to report. People are dying and they are not reporting those statistics. Cases are happening and they do not have testing. It is likely that the only reasonable data that you will get and it will be still be grossly under reported in terms of actual cases will be from the western democracies. However, the Transparency International score I have included gives me some confidence for all of the findings in this notebook.

    Deploy

    Update the John Holkins, Covid-19 data https://github.com/CSSEGISandData/COVID-19 by refreshing your local copy of the GitHub repository. The ARIMA model has been hyper parameter tuned, stored on disk and is ready to go. There is a forecasting module included for this dataset that can be ran at any time. It remains to be seen whether or not its predictions will be accurate. For additonal information please refer to the ReadMe.

    In [ ]: